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Abstract 

Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to under¬ 
standing this problem focus on statistical inference and generalization of individual analysis procedures. 
Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and 
hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of 
obtained results, and datas ets are sh ared and reused. An investigation of this gap has recently been 
initiated by the authors in [dFHM 4], where we focused on the problem of estimating expectations of 
adaptively chosen functions. 

In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate 
the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a 
holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself. We give an 
algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably 
avoiding overfitting. We illustrate the advantages of our algorithm over the standard use of the holdout 
set via a simple synthetic experiment. 

We also formalize and address the general problem of data reus e in adaptive data analysis. We 
show how the differential-privacy based approach given in |DFH'*~14) is applicable much more broadly 
to adaptive data analysis. We then show that a simple approach based on description length can also 
be used to give guarantees of statistical validity in adaptive settings. Finally, we demonstrate that 
these incomparable approaches can be unified via the notion of approximate max-information that we 
introduce. This, in particular, allows the preservation of statistical validity guarantees even when an 
analyst adaptively composes algorithms which have guarantees based on either of the two approaches. 


1 Introduction 

The goal of machine learning is to produce hypotheses or models that generalize well to the unseen instances 
of the problem. More generally, statistical data analysis is concerned with estimating properties of the 
underlying data distribution, rather than properties that are specific to the finite data set at hand. Indeed, a 
large body of theoretical and empirical research was developed for ensuring generalization in a variety of 
settings. In this work, it is commonly assumed that each analysis procedure (such as a learning algorithm) 
operates on a freshly sampled dataset - or if not, is validated on a freshly sampled holdout (or testing) set. 

Unfortunately, learning and inference can be more difficult in practice, where data samples are often 
reused. For example, a common practice is to perform feature selection on a dataset, and then use the 
features for some supervised learning task. When these two steps are performed on the same dataset, it is no 
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longer clear that the results obtained from the combined algorithm will generalize. Although not usually 
understood in these terms, “Freedman’s paradox” is an elegant demonstration of the powerful (negative) effect 
of adaptive analysis on the same data |Fre83) . In Freedman’s simulation, variables with significant t-statistic 
are selected and linear regression is performed on this adaptively chosen subset of variables, with famously 
misleading results: when the relationship between the dependent and explanatory variables is non-existent, 
the procedure overfits, erroneously declaring significant relationships. 

Most of machine learning practice does not rely on formal guarantees of generalization for learning 
algorithms. Instead a dataset is split randomly into two (or sometimes more) parts: the training set and the 
testing, or holdout, set. The training set is used for learning a predictor, and then the holdout set is used 
to estimate the accuracy of the predictor on the true distributiorQ Because the predictor is independent 
of the holdout dataset, such an estimate is a valid estimate of the true prediction accuracy (formally, this 
allows one to construct a confidence interval for the prediction accuracy on the data distribution). However, 
in practice the holdout dataset is rarely used only once, and as a result the predictor may not be independent 
of the holdout set, resulting in overfitting to the holdout set |Reu031 [RF081 [CT10| . One well-known reason 
for such dependence is that the holdout data is used to test a large number of predictors and only the best 
one is reported. If the set of all tested hypotheses is known and independent of the holdout set, then it is 
easy to account for such multiple testing or use the more sophisticated approach of Ng |Ng97| . 

However such static approaches do not apply if the estimates or hypotheses tested on the holdout are 
chosen adaptively: that is, if the choice of hypotheses depends on previous analyses performed on the 
dataset. One prominent example in which a holdout set is often adaptively reused is hyperparameter tuning 
('e.o. |DFN07j l. Similarly, the holdout set in a machine learning competition, such as the famous ImageNet 
competition, is typically reused many times adaptively. Other examples include using the holdout set for 
feature selection, generation of base learners (in aggregation techniques such as boosting and bagging), 
checking a stopping condition, and analyst-in-the-loop decisions. See |Lan05| for a discussion of several subtle 
causes of overfitting. 

The concrete practical problem we address is how to ensure that the holdout set can be reused to perform 
validation in the adaptive setting. Towards addressing this problem we also ask the more general question 
of how one can ensure that the final output of adaptive data analysis generalizes to the underlying data 
distribution. This line of research was recently initiated by the authors in |DFH~*~14] . where we focused on 
the case of estimating expectations of functions from i.i.d. samples (these are also referred to as statistical 
queries). They show how to answer a large number of adaptively chosen statistical queries using techniques 
from differential privacy |DMNS06] (see Sec. 1.3 and Sec. 2.2 for more details). 


1.1 Our Results 

We propose a simple and general formulation of the problem of preserving statistical validity in adaptive 
data analysis. We show that the connection between differentially private algorithms and generalization from 
DFH~*~14] can be extended to this more general setting, and show that similar (but sometimes incomparable) 
guarantees can be obtained from algorithms whose outputs can be described by short strings. We then 
define a new notion, approximate max-information, that unifies these two basic techniques and gives a new 
perspective on the problem. In particular, we give an adaptive composition theorem for max-information, 
which gives a simple way to obtain generalization guarantees for analyses in which some of the procedures 
are differentially private and some have short description length outputs. We apply our techniques to the 
problem of reusing the holdout set for validation in the adaptive setting. 


1.1.1 A Reusable Holdout 

We describe a simple and general method, together with two specific instantiations, for reusing a holdout 
set for validating results while provably avoiding overfitting to the holdout set. The analyst can perform 
any analysis on the training dataset, but can only access the holdout set via an algorithm that allows the 

^Additional averaging over different partitions is used in cross-validation. 
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analyst to validate her hypotheses against the holdout set. Crucially, our algorithm prevents overfitting to the 
holdout set even when the analysts hypotheses are chosen adaptively on the basis of the previous responses 
of our algorithm. 

Our first algorithm, referred to as Thresholdout, derives its guarantees from differential privacy and the 
results in jDFH~*~l^ INS15| . For any function : df —>■ [0,1] given by the analyst, Thresholdout uses the 
holdout set to validate that (f) does not overfit to the training set, that is, it checks that the mean value of (j) 
evaluated on the training set is close to the mean value of (j) evaluated on the distribution V from which the 
data was sampled. The standard approach to such validation would be to compute the mean value of (j) on the 
holdout set. The use of the holdout set in Thresholdout differs from the standard use in that it exposes very 
little information about the mean of (j) on the holdout set: if (/> does not overfit to the training set, then the 
analyst receives only the confirmation of closeness, that is, just a single bit. On the other hand, if (j) overfits 
then Thresholdout returns the mean value of (p on the training set perturbed by carefully calibrated noise. 

Using results from [DFH+14llNS15] we show that for datasets consisting of i.i.d. samples these modifications 
provably prevent the analyst from constructing functions that overfit to the holdout set. This ensures 
correctness of Thresholdout’s responses. Naturally, the specific guarantees depend on the number of samples 
n in the holdout set. The number of queries that Thresholdout can answer is exponential in n as long as the 
number of times that the analyst overfits is at most quadratic in n. 

Our second algorithm SparseValidate is based on the idea that if most of the time the analysts procedures 
generate results that do not overfit, then validating them against the holdout set does not reveal much 
information about the holdout set. Specifically, the generalization guarantees of this method follow from 
the observation that the transcript of the interaction between a data analyst and the holdout set can be 
described concisely. More formally, this method allows the analyst to pick any Boolean function of a dataset 
tp (described by an algorithm) and receive back its value on the holdout set. A simple example of such a 
function would be whether the accuracy of a predictor on the holdout set is at least a certain value a. (Unlike 
in the case of Thresholdout, here there is no need to assume that the function that measures the accuracy 
has a bounded range or even Lipschitz, making it qualitatively different from the kinds of results achievable 
subject to differential privacy). A more involved example of validation would be to run an algorithm on 
the holdout dataset to select an hypothesis and check if the hypothesis is similar to that obtained on the 
training set (for any desired notion of similarity). Such validation can be applied to other results of analysis; 
for example one could check if the variables selected on the holdout set have large overlap with those selected 
on the training set. An instantiation of the SparseValidate algorithm has already been applied to the problem 
of answering statistical (and more general) queries in the adaptive setting [BSSU15) . We describe the formal 
guarantees for SparseValidate in Section [T^ 

In Section we describe a simple experiment on synthetic data that illustrates the danger of reusing 
a standard holdout set, and how this issue can be resolved by our reusable holdout. The design of this 
experiment is inspired by Freedman’s classical experiment, which demonstrated the dangers of performing 
variable selection and regression on the same data |Fre83| . 

1.2 Generalization in Adaptive Data Analysis 

We view adaptive analysis on the same dataset as an execution of a sequence of steps Ai —A 2 —t • • • —t Am- 
Each step is described by an algorithm Ai that takes as input a fixed dataset S = (xi, ..., Xn) drawn from 
some distribution V over A", which remains unchanged over the course of the analysis. Each algorithm Ai 
also takes as input the outputs of the previously run algorithms Ai through Ai-i and produces a value in 
some range Ti- The dependence on previous outputs represents all the adaptive choices that are made at step 
i of data analysis. For example, depending on the previous outputs, Ai can run different types of analysis on 
S. We note that at this level of generality, the algorithms can represent the choices of the data analyst, and 
need not be explicitly specified. We assume that the analyst uses algorithms which individually are known to 
generalize when executed on a fresh dataset sampled independently from a distribution V. We formalize 
this by assuming that for every fixed value yi,, yi-i G Ti x • • • x Ti-ii with probability at least 1 — /3i 
over the choice of S according to distribution V, the output of Ai on inputs ?/i,... ,2/i-i and S has a desired 
property relative to the data distribution T) (for example has low generalization error). Note that in this 
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assumption yi,..., yi-i are fixed and independent of the choice of S', whereas the analyst will execute Ai on 
values Yi,..., li_i, where Yj = Aj{S, Yi,..., Y^-i). In other words, in the adaptive setup, the algorithm 
Ai can depend on the previous outputs, which depend on S, and thus the set S given to Ai is no longer 
an independently sampled dataset. Such dependence invalidates the generalization guarantees of individual 
procedures, potentially leading to overfitting. 


Differential privacy: First, we spell out how the differential privacy based approach from DFH~*~14 


can 

be applied to this more general setting. Specihcally, a simple corollary of results in |DFH~*~14] is that for a 
dataset consisting of i.i.d. samples any output of a differentially-private algorithm can be used in subsequent 
analysis while controlling the risk of overfitting, even beyond the setting of statistical queries studied in 
[DFH+14| . A key property of differential privacy in this context is that it composes adaptively: namely if each 
of the algorithms used by the analyst is differentially private, then the whole procedure will be differentially 
private (albeit with worse privacy parameters). Therefore, one way to avoid overhtting in the adaptive setting 
is to use algorithms that satisfy (sufficiently strong) guarantees of differential-privacy. In Section 2.2 we 
describe this result formally. 


Description length: We then show how description length bounds can be applied in the context of 
guaranteeing generalization in the presence of adaptivity. If the total length of the outputs of algorithms 
Ai,..., Ai-i can be described with k bits then there are at most 2^ possible values of the input j/i,..., yi-i 
to Ai- For each of these individual inputs Ai generalizes with probability 1 — /3i. Taking a union bound over 
failure probabilities implies generalization with probability at least 1 — 2^/3^. Occam’s Razor famously implies 
that shorter hypotheses have lower generalization error. Our observation is that shorter hypotheses (and the 
results of analysis more generally) are also better in the adaptive setting since they reveal less about the 
dataset and lead to better generalization of subsequent analyses. Note that this result makes no assumptions 
about the data distribution V. We provide the formal details in Section [2^ In Section]^ we also show that 
description length-based analysis suffices for obtaining an algorithm (albeit not an efficient one) that can 
answer an exponentially large number of adaptively chosen statistical queries. This provides an alternative 
proof for one of the results in |DFH+I4] . 


Approximate max-information: Our main technical contribution is the introduction and analysis of 
a new information-theoretic measure, which unifies the generalization arguments that come from both 
differential privacy and description length, and that quantifies how much information has been learned about 
the data by the analyst. Formally, for jointly distributed random variables (S', Y), the max-information is 
the maximum of the logarithm of the factor by which uncertainty about S is reduced given the value of 
Y, namely /oo(S, Y) = log max , where the maximum is taken over all S in the support of S 

and y in the support Y. Informally, /3-approximate max-information requires that the logarithm above be 
bounded with probability at least 1 — /3 over the choice of (S, Y) (the actual definition is slightly weaker, see 
Definition 10 for details).In our use, S denotes a dataset drawn randomly from the distribution T) and Y 
denotes the output of a (possibly randomized) algorithm on S. We prove that approximate max-information 
has the following properties 


• An upper bound on (approximate) max-information gives generalization guarantees. 

• Differentially private algorithms have low max-information for any distribution T) over datasets. A 
stronger bound holds for approximate max-information on i.i.d. datasets. These bounds apply only to 
so-called pure differential privacy (the 6 = 0 case). 

• Bounds on the description length of the output of an algorithm give bounds on the approximate 
max-information of the algorithm for any V. 

• Approximate max-information composes adaptively. 

• Approximate max-information is preserved under post-processing. 
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Composition properties of approximate max-information imply that one can easily obtain generalization 
guarantees for adaptive sequences of algorithms, some of which are differentially private, and others of which 
have outputs with short description length. These properties also imply that differential privacy can be used 
to control generalization for any distribution T) over datasets, which extends its generalization guarantees 
beyond the restriction to datasets drawn i.i.d. from a fixed distribution, as in [DFH+li] . 

We remark that (pure) differential privacy and description length are otherwise incomparable - low 
description length is not a sufhcient condition for differential privacy, since differential privacy precludes 
revealing even a small number of bits of information about any single individual in the data set. At the same 
time differential privacy does not constrain the description length of the output. Bounds on max-information 
or differential privacy of an algorithm can, however, be translated to bounds on randomized description length 
for a different algorithm with statistically indistinguishable output. Here we say that a randomized algorithm 
has randomized description length of k if for every fixing of the algorithm’s random bits, it has description 
length of k. Details of these results and additional discussion appear in Sections [3| and [A} 

1.3 Related Work 

This work builds on |DFH+14| where we initiated the formal study of adaptivity in data analysis. The 
primary focus of |DFH~*~1^ is the problem of answering adaptively chosen statistical queries. The main 
technique is a strong connection between differential privacy and generalization: differential privacy guarantees 
that the distribution of outputs does not depend too much on any one of the data samples, and thus, 
differential privacy gives a strong stability guarantee that behaves well under adaptive data analysis. The 
link between generalization and approximate differential privacy made in [DFH~*~1^ has been subsequently 
strengthened, both qualitatively — by |BSSU15] . who make the connection for a broader range of queries — 
and quantitatively, by [NS15| and |BSSU15j . who give tighter quantitative bounds. These papers, among 
other results, give methods for accurately answering exponentially (in the dataset size) many adaptively 
chosen queries, but the algorithms for this task are not efficient. It turns out this is for fundamental reasons 
- Hardt and Ullman |HU14] and Steinke and Ullman [SU14| prove that, under cryptographic assumptions, 
no efficient algorithm can answer more than quadratically many statistical queries chosen adaptively by an 
adversary who knows the true data distribution. 

Differential privacy emerged from a line of work [DN03[ IDN04[ IBDMN05] , culminating in the definition 
given by IDMNSOb] . There is a very large body of work designing differentially private algorithms for various 
data analysis tasks, some of which we leverage in our applications. See |Dwoll] for a short survey and |DR,14] 
for a textbook introduction to differential privacy. 

The classical approach in theoretical machine learning to ensure that empirical estimates generalize to 
the underlying distribution is based on the various notions of complexity of the set of functions output by 
the algorithm, most notably the VC dimension(see e.g. |SSBD14] for a textbook introduction). If one has a 
sample of data large enough to guarantee generalization for all functions in some class of bounded complexity, 
then it does not matter whether the data analyst chooses functions in this class adaptively or non-adaptively. 
Our goal, in contrast, is to prove generalization bounds without making any assumptions about the class 
from which the analyst can choose query functions. In this case the adaptive setting is very different from 
the non-adaptive setting. 

An important line of work |BE02l IMNPR.O^ IPRMNO^ LSSSSSIO] establishes connections between the 
stability of a learning algorithm and its ability to generalize. Stability is a measure of how much the output 
of a learning algorithm is perturbed by changes to its input. It is known that certain stability notions are 
necessary and sufficient for generalization. Unfortunately, the stability notions considered in these prior 
works do not compose in the sense that running multiple stable algorithms sequentially and adaptively may 
result in a procedure that is not stable. The measure we introduce in this work (max information), like 
differential privacy, has the strength that it enjoys adaptive composition guarantees. This makes it amenable 
to reasoning about the generalization properties of adaptively applied sequences of algorithms, while having 
to analyze only the individual components of these algorithms. Connections between stability, empirical risk 
minimization and differential privacy in the context of learnability have been recently explored in |WLF15] . 
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Freund gives an approach to obtaining data-dependent generalization bounds that takes into account the 
set of statistical queries that a given learning algorithm can produce for the distribution from which the 
data was sampled |Fre98| . A related approach of Langford and Blum also allows to obtain data-dependent 
generalization bounds based on the description length of functions that can be output for a data distribution 
[LB03j . Unlike our work, these approaches require the knowledge of the structure of the learning algorithm 
to derive a generalization bound. More importantly, the focus of our framework is on the design of new 
algorithms with better generalization properties in the adaptive setting. 

Finally, inspired by our work, Blum and Hardt [BH15) showed how to reuse the holdout set to maintain 
an accurate leaderboard in a machine learning competition that allows the participants to submit adaptively 
chosen models in the process of the competition (such as those organized by Kaggle Inc.). Their analysis also 
relies on the description length-based technique we used to analyze SparseValidate. 


2 Preliminaries and Basic Techniques 


In the discussion below log refers to binary logarithm and In refers to the natural logarithm. For simplicity 
we restrict our random variables to finite domains (extension of the claims to continuous domains is 
straightforward using the standard formalism). For two random variables X and Y over the same domain X 
the max-divergence of X from Y is defined as 


(5-approximate max-divergence is defined as 


dL{x\\y) 


, P[X G C>] - (5 

log max ——7—- —7 — 

ocx,r[xeO]>s f‘\Y & O] 


We say that a real-valued function over datasets / : A" —?> M has sensitivity c for all i G [n] and 
Xi,X 2 ^..., Xn, x[ G A, f{xi,... ,Xi,..., Xn) — f(xi ,..., cc',..., Xn) < c. We review McDiarmid’s concentration 
inequality for functions of low-sensitivity. 


Lemma 1 (McDiarmid’s inequality). Let Wi, A 2 ,..., he independent random variables taking values 
in the set X. Further let f : A" —> K 6 e a function of sensitivity c > 0. Then for all a > 0, and 
/r = E[/(Xi,...,X„)], 


/CXi,...,X„)-/r> 


< exp 


-2a^ 


For a function (j) : X ^ R and a dataset S = (si,... ,a;„), let Ss[4>\ = ^ Note that if the 

range of (j) is in some interval of length a then f{S) = has sensitivity a/n. For a distribution V over X 

and a function </>: A —>• K, let = E 2 „..,-p[ 0 (x)]. 


2.1 Differential Privacy 

On an intuitive level, differential privacy hides the data of any single individual. We are thus interested in 
pairs of datasets S, S' that differ in a single element, in which case we say S and S' are adjacent. 

Definition 2. jDMNSOdl \DKM'^ 0 ^ A randomized algorithm A with domain A" for n > 0 is {e,6)- 
differentially private if for all pairs of datasets that differ in a single element S, S' G A".- D^(M(S')||M(S'')) < 
log(e®). The case when 6 = 0 is sometimes referred to as pure differential privacy, and in this case we may 
say simply that A is e-differentially private. 

Differential privacy is preserved under adaptive composition. Adaptive composition of algorithms is 
a sequential execution of algorithms on the same dataset in which an algorithm at step i can depend on 
the outputs of previous algorithms. More formally, let Mi, A 2 ,.. ■ ,Am be a sequence of algorithms. Each 
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algorithm Ai outputs a value in some range and takes as an input dataset in A" as well as a value in 
yi-i = yi X ■■■ X yi-i- Adaptive composition of these algorithm is the algorithm that takes as an input 
a dataset S G df” and executes Ai A 2 Am sequentially with the input to Ai being S and the 

outputs 2 / 1 ,, i/i_i of Ai,... ,Ai-i. Such composition captures the common practice in data analysis of 
using the outcomes of previous analyses (that is yi,... ,yi-i) to select an algorithm that is executed on S. 

For an algorithm that in addition to a dataset has other input we say that it is (e, i5)-differentially private 
if it is (e, i5)-differentially private for every setting of additional parameters. The basic property of adaptive 
composition of differentially private algorithms is the following result ie.Q. |DL09| l: 

Theorem 3. Let Ai : A" x x • • • x X-i —> yi be an {si, Si)-differentially private algorithm for i G [m]. 
Then the algorithm B : A" —obtained by eomposing Ai’s adaptively is Si)-differentially 

private. 

A more sophisticated argument yields significant improvement when e < 1 (e.y.jDRii]). 

Theorem 4. For all e, S, S' > 0, the adaptive composition of m arbitrary (e, S)-differentially private algorithms 
is {A,mS S')-differentially private, where 

e' = yj2m ln(l/i5') • e -\- me(e® — 1). 

Another property of differential privacy important for our applications is preservation of its guarantee 
under post-processing fe.o. lDRML Prop. 2.1]): 

Lemma 5. If A is an {€,S)-differentially private algorithm with domain A" and range y, and B is any, 
possibly randomized, algorithm with domain y and range y" , then the algorithm B o A with domain A" and 
range y' is also {e,S)-differentially private. 

2.2 Generalization via Differential Privacy 

Generalization in special cases of our general adaptive analysis setting can be obtained directly from results in 
DFH+ 14 ] and composition properties of differentially private algorithms. For the case of pure differentially 
private algorithms with general outputs over i.i.d. datasets, in |DFH~*~14] we prove the following result. 

Theorem 6. Let A be an e-differentially private algorithm with range y and let S be a random variable 
drawn from a distribution V" over A”. Let Y = A(S') be the corresponding output distribution. Assume that 

for each element y G y there is a subset R{y) C A" so that max^gy P[5' G R{y)] < P. Then, for e < 
we have P[S G A(T)] < hffp. 

An immediate corollary of Thm. [^together with Lemmais that differentially private algorithms that 
output low-sensitivity functions generalize. 

Corollary 7. Let A be an algorithm that outputs a c-sensitive function f : A" —>■ K. Let S be a random 
dataset chosen according to distribution V" over A" and let f = A(S'). If A is t/ [cn)-differentially private 
then V[f{S) — "P")/] > t] < 3exp (—T^/(c^n)). 

By Theorem pure differential privacy composes adaptively. Therefore, if in a sequence of algorithms 
Ai, A 2 , ..., Am algorithm Ai is £i-differentially private for all z < m — 1 then composition of the first i — 1 
algorithms is £'_i-differentially private for e'^_l = Theoremcan be applied to preserve the 

generalization guarantees of the last algorithm Am (that does not need to be differentially private). For 
example, assume that for every fixed setting of Pm-i, Am has the property that it outputs a hypothesis 
function h such that, F[£s[L{h)] — V[L(h)] > r] < for some notion of dimension d and a real-valued 

loss function L. Generalization bounds of this type follow from uniform convergence arguments based on 
various notions of complexity of hypotheses classes such as VC dimension, covering numbers, fat-shattering 
dimension and Rademacher complexity (see [SSBD14] for examples). Note that, for different settings of Pm-i, 


7 


















different sets of hypotheses and generalization techniques might be used. We define R{ym-i) be all datasets 
S for which Am[S,ym-i) outputs h such that £s[L(h)] — > r. Now if l{2d), then even 

for the hypothesis output in the adaptive execution of Am on a random i.i.d. dataset S (denoted by h) we 
have P [£s[L{h)] - r[L{h)] > r] < 

For approximate (e, i5)-differential privacy, strong preservation of generalization results are currently 
known only for algorithms that output a function over X of bounded range (for simplicity we use range [0,1]) 
DFH~*~14l INS15(. The following result was proved by Nissim and Stemmer [NS15| (a weaker statement is 
also given in jDFH~*~14l Thm. 10]). 

Theorem 8. Let A be an {e, 5)-differentially private algorithm that outputs a function from X to [0,1]. For 
a random variable S distributed according to we let (p = ^(5'). Then for n > 21n(8/5)/e^, 

¥[\P[cP]-£sm>13e]<^ln(^^y 

Many learning algorithms output a hypothesis function that aims to minimize some bounded loss function 
L as the final output. If algorithms used in all steps of the adaptive data analysis are differentially private and 
the last step (that is, Am) outputs a hypothesis h, then generalization bounds for the loss of h are implied 
directly by Theorem]^ We remark that this application is different from the example for pure differential 
privacy above since there we showed preservation of generalization guarantees of arbitrarily complex learning 
algorithm Am which need not be differentially private. In Section we give an application of Theorem to 
the reusable holdout problem. 

2.3 Generalization via Description Length 

Let A : T" —)■ y and B : T" x 3^ —)■ 3^' be two algorithms. We now give a simple application of bounds on 
the size of y (or, equivalently, the description length of M’s output) to preserving generalization of B. Here 
generalization can actually refer to any valid or desirable output of B for a given given dataset S and input 
y & y. Specifically we will use a set R{y) C X^ to denote all datasets for which the output of H on y and S 
is “bad” {e.g. overfits). Using a simple union bound we show that the probability (over a random choice of a 
dataset) of such bad outcome can be bounded. 

Theorem 9. Let A : T” y be an algorithm and let S be a random dataset over T". Assume that 
R : y ^ 2'^" is sueh that for every y G y, G R^y)] < A Then P)^ G R{A{S))] < [3^1 • fi. 

Proof. 

P[5 G R{A{S))] < ^ P[5 G R{y)] < 13^1 • [3. 


The case of two algorithms implies the general case since description length composes (adaptively). 
Namely, let Ai,A 2 ,.-- be a sequence of algorithms such that each algorithm Ai outputs a value in some 
range 3^i and takes as an input dataset in T" as well as a value in 3^i_i. Then for every i, we can view 
the execution of Ai through Ai-i as the first algorithm Ai-i with an output in 3^i_i and Ai as the second 
algorithm. Theoremimplies that if for every setting of = j/i,... ,yi-i G 3^i-i, R{yi-i) Q X^ satisfies 
that P[S G i?(yi_i)] < Pi then 


i-l 

¥[S G i?(M._i(5))] < l3>*-i| P^ = l[ \yj\■ Pr- 

J=i 

In Section]^ we describe a generalization of description length bounds to randomized algorithms and 
show that it possesses the same properties. 











3 Max-Information 


Consider two algorithms A : tT" —> y and B : xy ^y' that are composed adaptively and assume that for 

every fixed input y G y, B generalizes for all but fraction /3 of datasets. Here we are speaking of generalization 
informally: our definitions will support any property of input y Gy and dataset S. Intuitively, to preserve 
generalization of B we want to make sure that the output of A does not reveal too much information about 
the dataset S. We demonstrate that this intuition can be captured via a notion of max-information and its 
relaxation approximate max-information. 

For two random variables X and Y we use X x IF to denote the random variable obtained by drawing 
X and Y independently from their probability distributions. 

Definition 10. Let X and Y be jointly distributed random variables. The max-information between X and 
Y, denoted Iao{X;Y), is the minimal value of k such that for every x in the support of X and y in the 
support ofY we have P[X = x \ Y = j/] < 2^P[X = x]. Alternatively, I^{X;Y) = D^{{X,Y)\\X x Y). 

The P-approximate max-information is defined as I^{X;Y) = D^{(X,Y)\\X x 

It follows immediately from Bayes’ rule that for all /? > 0, I^{X;Y) = I^{Y ; X). Further, Ioo{X]Y) < k 
if and only if for all x in the support of X, D^ofY | X = a; || Y) < k. Clearly, max-information upper 
bounds the classical notion of mutual information: Iao{X; Y) > I{X-Y). 

In our use {X,Y) is going to be a joint distribution {S,AiS)), where S' is a random n-element dataset 
and ^ is a (possibly randomized) algorithm taking a dataset as an input. If the output of an algorithm on any 
distribution S has low approximate max-information then we say that the algorithm has low max-information. 
More formally: 

Definition 11. We say that an algorithm A has fd-approximate max-information of k if for every distribution 
S over n-element datasets, /^(S;M(S)) < k, where S is a dataset chosen randomly according to S. We 
denote this by I^{A,n) < k. 

An alternative way to define the (pure) max-information of an algorithm is using the maximum of the 
inhnity divergence between distributions on two different inputs. 

Lemma 12. Let A be an algorithm with domain A" and range y. Then Ioo{A, n) = mdcscs^s' Dao{A{S)\\A{S')). 

Proof. For the first direction let k = max 5 _S'g;t" Doo(A(S)||A(S')). Let S be any random variable over 
n-element input datasets for A and let Y be the corresponding output distribution Y = A{S). We will argue 
that Ioo{X]S) < k, that Loo{S]Y) < k follows immediately from the Bayes’ rule. For every y Gy, there 
must exist a dataset Sy such that P[1F = y | 5 = Ay] < P[1F = y]. Now, by our assumption, for every S, 

P[Y = y I S' = A] < 2''’ • P)!^ = y \ S = Sy]. We can conclude that for every A and every y, it holds that 
P[y = y I A = A] < 2^= P[y = y]. This yields L^{Y-, S) < k. 

For the other direction let k = looiA, n), let A, S' G A" and y G y. For a G (0,1), let S be the random 
variable equal to A with probability a and to S' with probability 1 — a and let Y = A{S). By our assumption, 
I^{Y- S) = Loo{S; Y) < k. This gives 

P[y = y I A = A] < 2'=P[y = j/j < 2^= (aP[y = y\S = S] + {l-a)¥[Y = y\S = A']) 
and implies 

P[y = y I S = A] < . p[y = y I A = A']. 

1 — 

This holds for every a > 0 and therefore 

P[y = y I A = A] < 2''’ • P[y = y I A = A']. 

Using this inequality for every y G y we obtain ZIoo(A(A)||A(A')) < k. □ 
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Generalization via max-information: An immediate corollary of our definition of approximate max- 
information is that it controls the probability of “bad events” that can happen as a result of the dependence 
of A(S') on S. 

Theorem 13. Let S be a random dataset in A" and A be an algorithm with range y such that for some 
P>0, I^{S]A{S)) = k. Then for any event O C A” x y, 

V[{S,A{S)) e O] < 2^= • P[S' X A{S) eO] + p. 

In particular, P[(S', A(S)) € O] < 2^ • maxygy P[(S', y) & 0\+ P- 

We remark that mutual information between S and A(S) would not suffice for ensuring that bad events 
happen with tiny probability. For example mutual information of k allows P[(S', A(S')) € O] to be as high as 
k/{2\og{l/6)), where 5 = P[S' x A(S') € O]. 

Composition of max-information: Approximate max-information satisfies the following adaptive com¬ 
position property: 

Lemma 14. Let A : A" y be an algorithm such that I^{A,n) < ki, and let B : A" x y ^ Z be an 
algorithm such that for every y G y, B{-,y) has P 2 -(approximate max-information ^ 2 - Let C : A" -G Z be 
defined such that C{S) =B(S,A{S)). Then I^~^^^{C,n) < ki -\-k 2 - 

Proof. Let H be a distribution over A” and S' be a random dataset sampled from T>. By hypothesis, 
I^{S;A{S)) < ki- Expanding out the definition for all O C A” x y-. 

P[(S, A(S)) gO]< 2'=! • P[S X A{S) gO] + Pi . 

We also have for all Q C A" x Z and for all y € 3^: 

P[(S, B{S, y)) G Q] < 2'^^ ■ P[S x B{S, y) G Q] + P 2 ■ 

For every O C A" x y, define 

;x(C>) = (P[(S, A(S)) GO]- 2^^^ ■ P[S X A{S) G O])^ . 

Observe that < Pi for all O C A” x y. For any event Q C A" x Z, we have: 


10 



F[{S,CiS)) e Q] 

= n{S,B{S,AiS)))GQ] 

niS,B{S,y))eQ]-F[S = S,A{S)=y] 

< Y. min((2'=^-P[^xS(5,2/)e Q]+/32),1)-P[5 = 5 ,^(S)=j/] 

sex^,yey 

< Y (min(2^-=.P[^xS(5,j/)e Q],l)+^2)-P[^ = 5,^(S)=y] 

sex^,yey 

< Y m^n{2'^^-F[SxBiS,y)eQ]A)-F[S = S,A{S)=y]+p2 

sex^,yey 

< Y min(2'==.P[^xe(5,y)eQ],l)-(2'=i-P[5 = 5]-P[^(5)=y]+//(^,2/))+/?2 

sex^,yey 

< Y ™ (2'^' ■ y) e Q], 1) • 2^=1 • P[5 = 5] • nA{S) =y]+ Y ^{8, y) + 132 

sex'^,yey sex’^,yey 

< Y ™ (2'"" • ^[8 X Bis, y) e Q], 1) • 2^=1 • F[S = S] ■ P[x^(S') = y] + /3i + ^2 

sex^,yey 

< 2'=!+'=^ . I ^ P[5 X Bis, y) e Q] • P[5 = 5] • F[AiS) = y] ] + f3i + P 2 

\sex’-,yey ] 

= 2'=!+'=^ • F[S X Bis, AiS)) €Q] + iPi+P2) ■ 

Applying the definition of max-information, we see that equivalently, iS;CiS)) < fci + k 2 , which is 

what we wanted. □ 

This lemma can be iteratively applied, which immediately yields the following adaptive composition 
theorem for max-information: 

Theorem 15. Consider an arbitrary sequence of algorithms Ai ,..., Ak with ranges yi, ■ ■ ■ ,yk such that for 
all i, Ai : A" x x ... x 3^i_i —> yi is such that Aii-,yi,..., yi-i) has Pi-approximate max-information ki 
for all choices of yi,... ,yi-i € 3^i x ... x Let the algorithm B : A" —yk he defined as follows: 

BiS): 

1. Let yi = Ai(5'). 

2. For i = 2 to k: Let y^ = AiiS,yi,..., y^-i) 

3. Output yk 

Then B has {f^^Pi)-approximate max-information ij^i^i)- 

Post-processing of Max-information: Another useful property that (approximate) max-information 
shares with differential privacy is preservation under post-processing. The simple proof of this lemma is 
identical to that for differential privacy (Lemma and hence is omitted. 

Lemma 16. If A is an algorithm with domain A" and range y, and B is any, possibly randomized, algorithm 
with domain y and range y", then the algorithm B o A with domain A" and range y' satisfies: for every 
random variable S over and every P > 0, I^{S', B o A{S)) < I^{S',A{S)). 


11 


3.1 Bounds on Max-information 


We now show that the basic approaches based on description length and (pure) differential privacy are 
captured by approximate max-information. 


3.1.1 Description Length 

Description length k gives the following bound on max-information. 

Theorem 17. Let A be a randomized algorithm taking as an input an n-element dataset and outputting a 
value in a finite set y. Then for every /3 > 0, I^{A,n) < log(|3^|//3). 

We will use the following simple property of approximate divergence le.g. |DR14p in the proof. For a 
random variable X over X we denote by p{X) the probability distribution associated with X. 

Lemma 18. Let X and Y he two random variables over the same domain X. If 


P 

x~p{X) 


P[X = x] 
V[Y = x] 



</3 


then D^{X\\Y) < k. 

Proof Thm. |17| Let S be any random variable over n-element input datasets and let Y be the corresponding 
output distribution Y = A{S). We prove that for every /3 > 0, I^{S\Y) < log(|3^|//3). 

For y £y we say that y is “bad” if exists S in the support of S such that 


P[F = 2/ I 5 = 5] 
P[Y = y] 


> \y\/p- 


Let B denote the set of all “bad” j/’s. From this definition we obtain that for a “bad” y, P[T = y] < /3/|3^| 
and therefore P)!^ G B] < /3. Let B = T" x B. Then 


P[(5',y) G B]=F[Y G B] < /3. 


For every {S, y) ^ B we have that 


P)^ = S', Y = y] = P[y = ?/ I S' = S'] • P[5' = S'] < • P[Y = y] ■ P[S = S], 


and hence 


V[S = S,Y = y] 
iS,y)Apis,Y) LP[S = S] • P[Y = y] 


> 


/? 


</3- 


This, by Lemma 18 gives that I^{S;Y) < log(lYl//3). 


□ 


We note that Thms.[I^and[T7|give a slightly weaker form of Thm.Defining event O = US,v) \ S G R{y)}, 


the assumptions of Thm. [pimply that P[S x ^(S) G O] < fi. For fi' = by Thm. 

I^{S-,A{S)) < \og{\y\/P'). Now applying, Thm. 13 gives that P[S G i?(xl(S))] < \y\IP' ■ fi 
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we have that 

_ P' = 2./\y\p. 

In Section]^ we introduce a closely related notion of randomized description length and show that it also 
provides an upper bound on approximate max-informat ion. More interestingly, for this notion a form of the 
reverse bound can be proved: A bound on (approximate) max-information of an algorithm A implies a bound 
on the randomized description length of the output of a different algorithm with statistically indistinguishable 
from A output. 
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3.1.2 Differential Privacy 


We now show that pure differential privacy implies a bound on max information. We start with a simple 
bound on max-information of differentially private algorithms that applies to all distributions over datasets. 
In particular, it implies that the differential privacy-based approach can be used beyond the i.i.d. setting in 
[DFH+14] . 

Theorem 19. Let A be an e-differentially private algorithm. Then Ioo{A,n) < loge • en. 


Proof. Clearly, any two datasets S and S' differ in at most n elements. Therefore, for every y we have 
P[Y = y \ S = S] < e'^”’P[T = y I S' = 5"] (this is a direct implication of Definitionreferred to as group 
privacy [DR14j b or equivalently, Doo(.4(S)||.4(S')) < loge • en. By Lemmawe obtain the claim. □ 


Finally, we prove a stronger bound on approximate max-information for datasets consisting of i.i.d. samples 
using the technique from DFH+14] . This bound, together with Thm. 13 generalizes Thm.|^ 


Theorem 20. Let A be an e-differentially private algorithm with range y. For a distribution V over X, let 
S be a random variable drawn from V". Let Y = .4(S) denote the random variable output by A on input S. 
Then for any /3 > 0, I^{S-,A{S)) < loge(e^n/2-|-£1^7*411(27/3)72). 


Proof. Fix y G y. We first observe that by Jensen’s inequality, 


^ E^Jln(P[y = y\S 


S])] < In 


y\S = S]]) = ln(P[T = y]). 


Further, by definition of differential privacy, for two databases S, S' that differ in a single element. 


P[r = y I S' = S] < • P[T = 7/ I S = S']. 

Now consider the function g{S) = In ^^ ■ By the properties above we have that E[g(S)] < 

ln(P[T = y]) — ln(P['T = 7/]) = 0 and \g{S) — 5 (S')| < e. This, by McDiarmid’s inequality (Lemmaj^, implies 
that for any t > 0, 

( 1 ) 

For an integer i > 1, let ti = e‘^n/2 + e^Jn ln(2Y/3)/2 and let 

Bi = {S \ti< g{S) < ti+i } . 


Let 


By = {S |g(S)>ti} = Ui3,. 

i>l 

By inequality Q, we have that for i > 1, 

P[ 5 (S) > ti] < exp (—2 {e^/nl2 F A/ln(2* 


By Bayes’ rule, for every S G Bi, 

P[S = S I Y = y] P[Y = y I S = S] 


P[S = S] 


P[Y = y] 


= exp(y(S)) < exp(ti+i). 
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Therefore, 


P[5 e I r = y] = ^ P[5 = 5 I Y = y] 

seBi 

< exp(fi+i) • ^ P[S' = S'] 

seBi 

< exp{ti+,) ■ P[y(S) > ti] 

= exp ^^nl2 + e^Jn ln(2®+i//3)/2 — 2 + \/ln(2Y/3)/2^ 

< exp (exA72 (\/ln(2*+V/3) - 2A/ln(2V/3)) - ln(2V/3)) 

< exp(-ln(2V/3)) = /3/2h 
An immediate implication of this is that 

P[S G S, I Y = y] = ^P[S e S, I Y = y] < ^/3/2* < /3. 

i i>l 

Let B = {(S, y) | y G 3^, S G By}. Then 


P[(S, Y) G B] = P[(S, Y) G By] < p. 

For every (S, y) ^ B we have that 

P[S = S, Y = y] = P[S = S I Y = y] • P[Y = y] < exp(ti) • P[S = S] • P[Y = y], 
and hence by eq.(|^ we get that 

P[S = S,Y = y] 


P[S = S] • P[Y = y] 


> exp(ti 


< 0 . 


(S,v)r~^p(S,Y) 

This, by Lemma [TSl gives that 

-^^(5'; Y) < log(exp(ti)) = log e(e^n/2 + £ 1/71 ln(2//3)/2). 


( 2 ) 


□ 


Applications: We give two simple examples of using the bounds on max-information obtained from 
differential privacy to preserve bounds on generalization error that follow from concentration of measure 
inequalities. Strong concentration of measure results are at the core of most generalization guarantees in 
machine learning. Let A be an algorithm that outputs a function / : A" —>■ K of sensitivity c and define the 
“bad event” Or is when the empirical estimate of / is more than r away from the expectation of /(S) for S 
distributed according to some distribution V over A". Namely, 


Or = {{S,f): f{S)-V[f]>T}, 


(3) 


where !?[/] denotes Es,^x>[/(«S')]. 

By McDiarmid’s inequality (Lem.j^ we know that, if S is distributed according to then sup 
Or] < exp(—2T^/(c^n)). The simpler bound in Thm. 19 implies following corollary. 


in{s,f)G 


Corollary 21. Let A be an algorithm that outputs a c-sensitive function f : A" — > K. Let S be a 
random dataset chosen according to distribution over A” and let f = A(B). If for 0 > 0 and t > 0, 
<loge-r^/c^, thenV[f{S)—V'^[f] > t] < exp / (c^n)) + 0. In particular, if A is l{c}nf)- 

differentially private then P[/(5') — B”[/] > r] < exp (—T^/(c^n)). 
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Note that for f{S) = where ^ —>■ [0,1] this result requires e = r^. The stronger bound allows to 

preserve concentration of measure even when e = T/(cn) which corresponds to t = e when f{S) = [</>]. 

Corollary 22. Let A be an algorithm that outputs a c-sensitive function f : > M. Let S be a random 

dataset chosen according to distribution 7^” over T”" and let f = ^(S'). If A is t/{ cn)-differentially private 
then P[/(S') — P^[f] > t] < exp (—3T^/(4c^n)). 


Proof. We apply Theorem 20 with /3 = 2 exp {—T^/{ern)) to obtain that 

/) < loge • (£:^n /2 + ey/nln(2//3)/2)) < loge • {T‘^/{c^n)l2 + j{c^n)/\/2). 

Applying Thm. [T^ to McDiarmid’s inequality we obtain that 

F[f{S) - V’^lf] > r] < exp ((1/2 + \ j\f2)T^ j fAnf) ■ exp (-2r^/(c^n)) + 2exp (-t^ 

< exp (4c^n)), 

where the last inequality holds when l{c^n) is larger than a fixed constant. 


□ 


4 Reusable Holdout 

We describe two simple algorithms that enable validation of analyst’s queries in the adaptive setting. 

4.1 Thresholdout 

Our first algorithm Thresholdout follows the approach in |DFH+14] where differentially private algorithms are 
used to answer adaptively chosen statistical queries. This approach can also be applied to any low-sensitivity 
functions 1^ of the dataset but for simplicity we present the results for statistical queries. Here we address an 
easier problem in which the analyst’s queries only need to be answered when they overfit. Also, unlike in 
DFH~*~14] . the analyst has full access to the training set and the holdout algorithm only prevents overhtting 
to holdout dataset. As a result, unlike in the general query answering setting, our algorithm can efficiently 
validate an exponential in n number of queries as long as a relatively small number of them overfit. 

Thresholdout is given access to the training dataset St and holdout dataset Sh and a budget limit B. It 
allows any query of the form (/ : A —>■ [0,1] and its goal is to provide an estimate of 'P[(j)\. To achieve this 
the algorithm gives an estimate of £s^ [</] in a way that prevents overhtting of functions generated by the 
analyst to the holdout set. In other words, responses of Thresholdout are designed to ensure that, with high 
probability, SsnWi close to and hence an estimate of SsnWi gives an estimate of the true expectation 
'P[(j)\. Given a function (/, Thresholdout hrst checks if the difference between the average value of (p on the 
training set St (or £st 14 ’] ) and the average value of p on the holdout set Sh (or £sh W\) is below a certain 
threshold T + rj. Here, T is a hxed number such as 0.01 and ?7 is a Laplace noise variable whose standard 
deviation needs to be chosen depending on the desired guarantees (The Laplace distribution is a symmetric 
exponential distribution.) If the difference is below the threshold, then the algorithm returns Sst\4\- If fbe 
difference is above the threshold, then the algorithm returns £ 3 ^ [4] + C fo^' another Laplacian noise variable 
f. Each time the difference is above threshold the “overfitting” budget B is reduced by one. Once it is 
exhausted, Thresholdout stops answering queries. In Fig. we provide the pseudocode of Thresholdout. 

We now establish the formal generalization guarantees that Thresholdout enjoys. As the first step we state 
what privacy parameters are achieved by Thresholdout. 

Lemma 23. Thresholdout satisfies (2B/{a-n), 0)-differential privacy. Thresholdout also satisfies 32B \ii{2/ S)/ (cm), S)- 

differential privacy for any 5 > 0. 

^Guarantees based on pure differential privacy follow from the same analysis. Proving generalization guarantees for 
low-sensitivity queries based on approximate differential privacy requires a modification of Thresholdout using techniques in 
IBSSU15I . 
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Algorithm Thresholdout Input: Training set St, holdout set Sh, threshold T, noise rate a, budget B 

1. sample 7 ^ Lap(2 • cr); T ^ T + 7 

2. For each query (p do 

(a) if i? < 1 output “_L” 

(b) else 

i. sample 7 ^ Lap(4 • cr) 

ii. if > T + V 

A. sample ^ ^ Lap(tT), 7 ^ Lap(2 • a) 

B. B B — 1 and T T + ^ 

C. output £shW\ +? 

iii. else output £stW\- 

Figure 1: The details of Thresholdout algorithm 


Proof. Thresholdout is an instantiation of a basic tool from differential privacy, the “Sparse Vector Algorithm” 
1 [DR 141 Algorithm 2]), together with the Laplace mechanism (' [DR 141 Defn. 3.3]). The sparse vector algorithm 
takes as input a sequence of c sensitivity 1/n querie^(here c = B, the budget), and for each query, attempts 
to determine whether the value of the query, evaluated on the private dataset, is above a fixed threshold T or 
below it. In our instantiation, the holdout set is the private data set, and each function (p corresponds to 
the following query evaluated on Sh'- f<p{Sh) ■= — 'S’StMl- (Note that the training set St is viewed 

as part of the definition of the query). Thresholdout then is equivalent to the following procedure: we run 
the sparse vector algorithm |DR141 Algorithm 2] with c = B, queries for each function p, and noise 
rate 2 (t. Whenever an above-threshold query is reported by the sparse vector algorithm, we release its 
value using the Laplace mechanism |DR14l Defn. 3.3] with noise rate a (this is what occurs every time 
Thresholdout answers by outputting £sh[P] + £,)■ By the privacy guarantee of the sparse vector algorithm 
1 |DR141 Thm. 3.25]), the sparse vector portion of Thresholdout satisfies (R/(tTn), 0)-differential privacy, and 

simultaneously satishes (^ g /2)-differential privacy. The Laplace mechanism portion of Thresholdout 

satisfies (i3/(cr?T,),0)-differential privacy by [DR14[ Thm. 3.6], and simultaneously satisfies ( ^S/2)- 
differential privacy by |DR14[ Thm. 3.6] and |DR14[ Cor. 3.21]. Finally, the composition of two mechanisms, 
the first of which is (ei, 5i)-differentially private, and the second of which is (£ 2 , i 52 )-differentially private is 
itself (ei -I- £ 2 , (5i -I- 52 )-differentially private (Thm. [^. Adding the privacy parameters of the Sparse Vector 
portion of Thresholdout and the Laplace mechanism portion of Thresholdout yield the parameters of our 
theorem. □ 


We note that tighter privacy parameters are possible (e.g. by invoking the parameters and guarantees of 
the algorithm “NumericSparse” f |DR141 Algorithm 3]), which already combines the Laplace addition step) - 
we chose simpler parameters for clarity. 

Note the seeming discrepancy between the guarantee provided by Thresholdout and generalization 
guarantees in Theorem]^ and Corollary]^ while Theorempromises generalization bounds for functions that 
are generated by a differentially private algorithm, here we allow an arbitrary data analyst to generate query 
functions in any way she chooses, with access to the training set and differentially private estimates of the 
means of her functions on the holdout set. The connection comes from preservation of differential privacy 
guarantee under post-processing (Lem. [^. 

We can now quantify the number of samples necessary to achieve generalization error r with probability 

®In fact, the theorems for the Sparse Vector algorithm in Dwork and Roth are stated for sensitivity 1 queries - we use them 
for sensitivity 1/n queries of the form [rf>], which results in all of the bounds being scaled down by a factor of n. 
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at least 1 — /3. 

Lemma 24. Let t,I3,T,B > 0. Let S denote the holdout dataset of size n drawn i.i.d. from a distribution V. 
Consider an algorithm that is given access to St and adaptively chooses functions 4>i,..., cfm : A" —> [0,1] 
while interacting with Thresholdout which is given datasets S, St and parameters a,B,T. If 

n > no{B,a,T, ft) = max{2i3/(crr), ln(6//3)/T^} 


n > ni{B, cr, r, /3) = 

ra 

then for every i e [m], P[|7^[</)i] — 8s[(t>i]\ '>t]< (3. 

Proof. Consider the first guarantee of Lemma |23[ In order to achieve generalization error r via Corollary 

(i.e. in order to guarantee that for every function cj) we have: > r] < 6e~'^ ") we need 

to have n large enough to achieve (e, 0)-differential privacy for e = t. To achieve this it suffices to have 
n > 2B/{aT). By ensuring that n > \n{6//3 )/t^ we also have that < (3. 

We can also make use of the second guarantee in Lemma together with the results of Nissim and 

Stemmer |NS15| (Thm. [^. In order to achieve generalization error r with probability 1 — /3 (i.e. in order 
to guarantee for every function (j) we have: P — £s[4>i\ \ > t] < /?), we can apply Thm. |^by setting 




= \/32Bln(2/5)/(crn) = r/13 and 5 = 26 in( 26 /r) 

80VBln(l/(T/3)) 


We can obtain these privacy parameters from Lemma 


23 


by choosing any n > v (for sufficiently small [3 and r). We remark that a somewhat worse bound 

of ni{B,a,T, fS) = follows by setting e = r/4 and 6 = (/3/8)'^/'^ in DFH+ldl Thm. 10]. □ 


Both settings lead to small generalization error and so we can pick whichever gives the larger bound. 
The first bound has grows linearly with B but is simpler can be easily extended to other distributions over 
datasets and to low-sensitivity functions. The second bound has quadratically better dependence on B at 
the expense of a slightly worse dependence on r. We can now apply our main results to get a generalization 
bound for the entire execution of Thresholdout. 


Theorem 25. Let /3,r > 0 and m > B > 0. We set T = 3r/4 and a = r/(96ln(4m//3)). Let S denote a 
holdout dataset of size n drawn i.i.d. from a distribution V and St be any additional dataset over X. Consider 
an algorithm that is given access to St and adaptively chooses functions </>i,..., (fm while interacting with 
Thresholdout which is given datasets S, St and values a,B,T. For every i G [m], let denote the answer of 
Thresholdout on function eft : X ^ [0,1]. Further, for every i G [m], we define the counter of overfitting 

Z, = \{j<z:\r[ci>,]-£sA4>j]\>r/2}\. 


Then 


P G [m], Zi < B k. ja^ — > t] < (3 


whenevern > min{no(i3,cr, t/8,/?/(2to)), ni(i?,cr, t/8,/?/(2to))} = O 


^ in(W/3) ^ .minis, 3 /S ln(ln(m// 3 )/T)}. 


Proof. There are two types of error we need to control: the deviation between at and the average value of 
4>i on the holdout set £s[4>i]-i the deviation between the average value of (pi on the holdout set and the 
expectation of pt on the underlying distribution, 'P[<pi]. Specifically, we decompose the error as 


P[a, 7 ^ T & |a, -S[c/),]| > r] < P [a, 7 ^ T & \a, - £s[4>i\\ > 7r/8] + F[\V[(p^] - £s[(pi]\ > t/8] . (4) 

To control the first term we need to bound the values of noise variables used by Thresholdout. For the second 
term we will use the generalization properties of Thresholdout given in Lemma [24| 

We now deal with the errors introduced by the noise variables. For i G [m], let rji, ^i and ji denote the 
random variables 77 , f and 7 , respectively, at step i of the execution of Thresholdout. We first note that each of 
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these variables is chosen from Laplace distribution at most m times. By properties of the Laplace distribution 
with parameter 4(t, we know that for every t > 0, P[|i 7 i| > t ■ 4cr] = Therefore for t = 21n(4m//3) we 

obtain 

P[|»7i| > 2ln(4m//3) • 4(t] < 

4to 

By the definition of cr, 81n(4TO//3) • <j = r/12. Applying the union bound we obtain that 

P[3i, \m\ > r/12] < /3/4, 

where by 3i we refer to 3i G [to] for brevity. Similarly, and 7 ^ are obtained by sampling from the Laplace 
distribution and each is re-randomized at most B times. Therefore 

|7i| > r/24] < B ■ j3lAm < /3/4 

and 

Ppi, l^ij > r/48] < B ■ P/Am < P/A. 

For answers that are different from T, we can now bound the first term of Equation (|^ by considering two 
cases, depending on whether Thresholdout answers query (pi by returning = £s[<pi\ + or by returning 
ai = £sP<pi]. First, consider queries whose answers are returned using the former condition. Under this 
condition, ja^ — £s[4>i]\ = j^i). Next, we consider the second case, those queries whose answers are returned 
using £st{(pi\. By definition of the algorithm, we have 

\ai - £s[4>i]\ = \£st[4>i] - £s[(pi\\ <T + 'ri + r](< 3t/ 4 -f [ 7^1 -f \r]i\. 

Combining these two cases implies that 

P[3i, Oi 7 ^ T & [oi - ^^[(/li]] > 7r/8] < max{P[3i, [li] > 7 t/ 8] , P [3i, \ji\ + {ilil > t/8]}. 

Noting that r/24 -|- t/12 = r /8 and applying our bound on variables r]i, and ji we get 


'[3i, ai^ ± k \ai- £s[(pi]\ > 7t/8] < P/2. 


By Lemma 24 for n > min{no(.B, a, r/ 8 , P/2m), ni{B, a, r/ 8 , /3/2to)}, 

P [\V[(pi] - £s[(pi] \ > t/ 8] < P/2m. 
Applying the union bound we obtain 

P[3*, \V[(Pi\-£s[cp/^\>T/8]<P/2. 
Combining this with Equation (§ and using in Equation @ we get that 

P[3*, Oi 7 ^ T & [fli - 'P[(pi]\ >t]<P. 


(5) 


To finish the proof we show that under the conditions on the noise variables and generalization error 
used above, we have that \i Zi < B then ai 7 I T. To see this, observe that for every j < i that reduces 
Thresholdout’s budget, we have 


\v[(Pj] - £sA4>j]\ > \£s[<pj] - £sJ</>,]| - mA - £s[<pj]\ 

>\T + ■jj +Vj\- I'Pif/’j] - Ss[(pj]\ 

>T - I77I - IJ77I - \V[cpj] - £s[(pj]\. 

This means that for every j < i that reduces the budget we have \V[4>j]—£st [(pj]\ > 3r/4—r/24—r/12 — r/8 = 
r/2 and hence (when the conditions on the noise variables and generalization error are satished) for every i, 
ii Zi < B then Thresholdout’s budget is not yet exhausted and ai 7 ^ T. We can therefore conclude that 

P[3i, Zi < B k \ai — > r] < /3. 


□ 
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Note that in the final bound on n, the term O j jg equal (up to a constant factor) to the number 

of samples that are necessary to answer m non-adaptively chosen queries with tolerance r and confidence 1 — /3. 
In particular, as in the non-adaptive setting, achievable tolerance r scales as 1/v^ (up to the logarithmic 
factor). Further, this bound allows m to be exponentially large in n as long as B grows sub-quadratically in 
n (that is, B < for a constant c > 0). 

Remark 26. In Thm. Sh is used solely to provide a candidate estimate of the expectation of each query 
function. The theorem holds for any other way to provide such estimates. In addition, the one-sided version 
of the algorithm can he used when catching only the one-sided error is necessary. For example, in many cases 
overfitting is problematic only if the training error estimate is larger than the true error. This is achieved by 
using the condition Esh ~ ^St [(j^] > T -\- p to detect overfitting. In this case only one-sided errors will be 
caught by Thresholdout and only one-sided overfitting will decrease the budget. 

4.2 SparseValidate 

We now present a general algorithm for validation on the holdout set that can validate many arbitrary queries 
as long as few of them fail the validation. The algorithm which we refer to as SparseValidate only reveals 
information about the holdout set when validation fails and therefore we use bounds based on description 
length to analyze its generalization guarantees. 

More formally, our algorithm allows the analyst to pick any Boolean function of a dataset ij} (or even any 
algorithm that outputs a single bit) and provides back the value of ip on the holdout set ip{Sh). SparseValidate 
has a budget m for the total number of queries that can be asked and budget B for the number of queries 
that returned 1. Once either of the budgets is exhausted, no additional answers are given. We now give a 
general description of the guarantees of SparseValidate. 

Theorem 27. Let S denote a randomly chosen holdout set of size n. Let A be an algorithm that is given 
access to SparseValidate(r 7 T,, B) and outputs queries ipi,..., ipm such that each ipi is in some set rhi of functions 
from T" to {0,1}. Assume that for every i G [m] and ipi G P[^i(S) = 1] < (3i. Let ipi be the random 
variable equal to the i ’th query of A on S. Then P['i/ji(S') = T\< li ■ Pi, where £i = (]) — 

Proof. Let B denote the algorithm that represents view the interaction of A with SparseValidate(TO, B) up 
until query i and outputs the all the i — 1 responses of SparseValidate(m, B) in this interaction. If there are B 
responses with value 1 in the interaction then all the responses after the last one are meaningless and can be 
assumed to be equal to 0. The number of binary strings of length i — 1 that contain at most B ones is exactly 
Q)- Therefore we can assume that the output domain of B has size ii and we denote it 
by y. Now, loT y £ y let R{y) be the set of datasets S such that ipi{S) = 1, where ipi is the function that 
A generates when the responses of SparseValidate(TO, B) are y and the input holdout dataset is S (for now 
assume that A is deterministic). By the conditions of the theorem we have that for every y, P[5 G B(y)] < Pi- 
Applying Thm. to B, we get that P)^ e i?(;B(5'))] < liPi, which is exactly the claim. We note that to 
address the case when A is randomized (including dependent on the random choice of the training set) we 
can use the argument above for every fixing of all the random bits of A. From there we obtain that the claim 
holds when the probability is taken also over the randomness of A. 

We remark that the proof can also be obtained via a more direct application of the union bound over all 
strings in y. But the proof via Thm. demonstrates the conceptual role that short description length plays 
in this application. □ 

In this general formulation it is the analyst’s responsibility to use the budgets economically and pick 
query functions that do not fail validation often. At the same time, SparseValidate ensures that (for the 
appropriate values of the parameters) the analyst can think of the holdout set as a fresh sample for the 
purposes of validation. Hence the analyst can pick queries in such a way that failing the validation reliably 
indicates overfitting. To relate this algorithm to Thresholdout, consider the validation query function that 
is the indicator of the condition \£s,, [</>] — Est [0] I > T + ry (note that this condition can be evaluated using 
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an algorithm with access to Sh)- This is precisely the condition that consumes the overfitting budget of 
Thresholdout. Now, as in Thresholdout, for every fixed (j), — V[(j)\\ > t] < If i? < r^n/In to, 

then we obtain that for every query (j) generated by the analyst, we still have strong concentration of the mean 
on the holdout set around the expectation; P[|£’s^ [</>] — 'P[(j)]\ > t] < 2e~'^ This implies that if the condition 
\^Sh W\ ~ ^St [0] I > T + ?? holds, then with high probability also the condition — Ss^ [</i] | > T + 77 — r holds, 
indicating overfitting. One notable distinction of Thresholdout from SparseValidate is that SparseValidate does 
not provide corrections in the case of overfitting. One way to remedy that is simply to use a version of 
SparseValidate that allows functions with values in {0, 1 ,... ,L}. It is easy to see that for such functions 
we would obtain the bound of the form output a value in [0, 1] with precision t, 

L = [1/rJ would suffice. However, in many cases a more economical solution would be to have a separate 
dataset which is used just for obtaining the correct estimates. 

An example of the application of SparseValidate for answering statistical and low-sensitivity queries that is 
based on our analysis can be found in |BSSIJ15] . The analysis of generalization on the holdout set in [BH15j 
and the analysis of the Median Mechanism we give in Section also rely on this sparsity-based technique. 

An alternative view of this algorithm is as a general template for designing algorithms for answering some 
specific type of adaptively chosen queries. Generalization guarantees specific to the type of query can then be 
obtained from our general analysis. For example, an algorithm that fits a mixture of Gaussians model to the 
data could define the validation query to be an algorithm that fits the mixture model to the holdout and 
obtains a vector of parameters Oh- The validation query then compares it with the vector of parameters 0t 
obtained on the training set and outputs 1 if the parameter vectors are “not close” (indicating overfitting). 
Given guarantees of statistical validity of the parameter estimation method in the static setting one could 
then derive guarantees for adaptive validation via Thm. |27| 


5 Experiments 

We describe a simple experiment on synthetic data that illustrates the danger of reusing a standard holdout 
set and how this issue can be resolved by our reusable holdout. In our experiment the analyst wants to 
build a classifier via the following common strategy. First the analyst finds a set of single attributes that 
are correlated with the class label. Then the analyst aggregates the correlated variables into a single model 
of higher accuracy (for example using boosting or bagging methods). More formally, the analyst is given 
a d-dimensional labeled data set S of size 2n and splits it randomly into a training set St and a holdout 
set Sh of equal size. We denote an element of S' by a tuple (a:, y) where x is a d-dimensional vector and 
y G { — 1,1} is the corresponding class label. The analyst wishes to select variables to be included in her 
classifier. For various values of the number of variables to select k, she picks k variables with the largest 
absolute correlations with the label. However, she verifies the correlations (with the label) on the holdout set 
and uses only those variables whose correlation agrees in sign with the correlation on the training set and 
both correlations are larger than some threshold in absolute value. She then creates a simple linear threshold 
classifier on the selected variables using only the signs of the correlations of the selected variables. A final 
test evaluates the classification accuracy of the classifier on both the training set and the holdout set. 

Formally, the algorithm is used to build a linear threshold classifier: 

1. For each attribute i € [d] compute the correlation with the label on the training and holdout sets: 
W'* = L(a:,y)es, ^nd Tcf = Ei^,y)eSH Let 

W = {i\wl- > 0; \w\\ > 1/v^; \w^\ > l/v^|} 

that is the set of variables for which w\ and have the same sign and both are at least ^l\fn in 
absolute value (this is the standard deviation of the correlation in our setting). Let 14 be the subset of 
variables in W with k largest values of |w*|. 

2. Gonstruct the classifier /(x) = sgn (Eigy^ sgn{wl) ■ Xj). 
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In the experiments we used an implementation of Thresholdout that differs somewhat from the algorithm 
we analyzed theoretically (given in Figure [^. Specifically, we set the parameters to be T = 0.04 and r = 0.01. 
This is lower than the values necessary for the proof (and which are not intended for direct application) 
but suffices to prevent overfitting in our experiment. Second, we use Gaussian noise instead of Laplacian 
noise as it has stronger concentration properties (in many differential privacy applications similar theoretical 
guarantees hold mechanisms based on Gaussian noise). 

No correlation between labels and data: In our first experiment, each attribute is drawn independently 
from the normal distribution N{0, 1) and we choose the class label y G {—1,1} uniformly at random so that 
there is no correlation between the data point and its label. We chose n = 10, 000, d = 10,000 and varied 
the number of selected variables k. In this scenario no classifier can achieve true accuracy better than 50%. 
Nevertheless, reusing a standard holdout results in reported accuracy of over 63% for k = 500 on both the 
training set and the holdout set (the standard deviation of the error is less than 0.5%). The average and 
standard deviation of results obtained from 100 independent executions of the experiment are plotted in 
Figure which also includes the accuracy of the classifier on another fresh data set of size n drawn from 
the same distribution. We then executed the same algorithm with our reusable holdout. The algorithm 
Thresholdout was invoked with T = 0.04 and r = 0.01 explaining why the accuracy of the classiher reported 
by Thresholdout is off by up to 0.04 whenever the accuracy on the holdout set is within 0.04 of the accuracy 
on the training set. Thresholdout prevents the algorithm from overfitting to the holdout set and gives a valid 
estimate of classifier accuracy. 


Standard holdout 
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Figure 2: No correlation between class labels and data points. The plot shows the classification accuracy of the 
classifier on training, holdout and fresh sets. Margins indicate the standard deviation. 

High correlation between labels and some of the variables: In our second experiment, the class 
labels are correlated with some of the variables. As before the label is randomly chosen from {—1,1} and 
each of the attributes is drawn from N{0, 1) aside from 20 attributes which are drawn from N{y ■ 0.06,1) 
where y is the class label. We execute the same algorithm on this data with both the standard holdout and 
Thresholdout and plot the results in FigureOur experiment shows that when using the reusable holdout, 
the algorithm still finds a good classifier while preventing overfitting. This illustrates that the reusable 
holdout simultaneously prevents overfitting and allows for the discovery of true statistical patterns. 

In Figures]^ andsimulations that used Thresholdout for selecting the variables also show the accuracy 
on the holdout set as reported by Thresholdout. For comparison purposes, in Figurewe plot the actual 
accuracy of the generated classifier on the holdout set (the parameters of the simulation are identical to those 
used in Figures]^ and |^. It demonstrates that there is essentially no overfitting to the holdout set. Note 
that the advantage of the accuracy reported by Thresholdout is that it can be used to make further data 
dependent decisions while mitigating the risk of overfitting. 

Discussion of the results: Overfitting to the standard holdout set arises in our experiment because the 
analyst reuses the holdout after using it to measure the correlation of single attributes. We first note that 
neither cross-validation nor bootstrap resolve this issue. If we used either of these methods to validate the 
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Figure 3: Some variables are correlated with the label. 
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Figure 4: Accuracy of the classifier produced with Thresholdout on the holdout set. 


correlations, overfitting would still arise due to using the same data for training and validation (of the final 
classifier). It is tempting to recommend other solutions to the specific problem we used in our experiment. 
Indeed, a significant number of methods in the statistics and machine learning literature deal with inference 
for fixed two-step procedures where the first step is variable selection (see |HTF09) for examples). Our 
experiment demonstrates that even in such simple and standard settings our method avoids overfitting 
without the need to use a specialized procedure - and, of course, extends more broadly. More importantly, 
the reusable holdout gives the analyst a general and principled method to perform multiple validation steps 
where previously the only known safe approach was to collect a fresh holdout set each time a function depends 
on the outcomes of previous validations. 


6 Conclusions 

In this work, we give a unifying view of two techniques (differential privacy and description length bounds) 
which preserve the generalization guarantees of subsequent algorithms in adaptively chosen sequences of data 
analyses. Although these two techniques both imply low max-information - and hence can be composed 
together while preserving their guarantees - the kinds of guarantees that can be achieved by either alone are 
incomparable. This suggests that the problem of generalization guarantees under adaptivity is ripe for future 
study on two fronts. First, the existing theory is likely already strong enough to develop practical algorithms 
with rigorous generalization guarantees, of which Thresholdout is an example. However additional empirical 
work is needed to better understand when and how the theory should be applied in specific application 
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scenarios. At the same time, new theory is also needed. As an example of a basic question we still do 
not know the answer to: even in the simple setting of adaptively reusing a holdout set for computing the 
expectations of boolean-valued predicates, is it possible to obtain stronger generalization guarantees (via any 
means) than those that are known to be achievable via differential privacy? 
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A Prom Max-information to Randomized Description Length 

In this section we demonstrate additional connections between max-information, differential privacy and 
description length. These connections are based on a generalization of description length to randomized 
algorithms that we refer to as randomized description length. 

Definition 28. For a universe y let A be a randomized algorithm with input in X and output in y. We say 
that the output of A has randomized description length k if for every fixed setting of random coin flips of A 
the set of possible outputs of A has size at most 2^. 
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We first note that just as the (deterministic) description length, randomized description length implies 
generalization and gives a bound on max-information. 

Theorem 29. Let A : T" y be an algorithm with randomized description length k and let S be a 
random dataset over T". Assume that R : y ^ 2^ is such that for every y G y, P[5' G R{y)] < /3- Then 
Pis' G RiA{S))] < 2W/3. 

Theorem 30. Let A be an algorithm with randomized description length k taking as an input an n-element 
dataset and outputting a value in y. Then for every (3 > 0, I^{A,n) < log(|3^|//3). 


Proof. Let S be any random variable over n-element input datasets and let Y be the corresponding output 
distribution Y = y^(S). It suffices to prove that for every (3 > 0, L^{S;Y) < k + log(l//3). 

Let R be the set of all possible values of the random bits of A and let TZ denote the uniform distribution 
over a choice of r G i?. For r G R, let Ar denote A with the random bits set to r and let Yj. = Ar(S}. Observe 
that by the definition of randomized description length, the range of Ar has size at most 2^. Therefore, by 
Theorem 17 we obtain that L^{S;Yr) < log(2''^//3). 

For any event O C T" x 3^ we have that 


p[(s,t)gc>]= e [p[(s, t;) g o]] 

r~7^ 


< E 

rr^'R. 


■ ofc 


— - ¥[3 y.Yr gO] + 13 


— • E [F[S X Yr G O]] + p 
j3 r~n 


= — • P[S X T G O] + ^. 


By the definition of /3-approximate max-information, we obtain that I^{S\ Y) < log(2*^//3). □ 

We next show that if the output of an algorithm A has low approximate max-information about its input 
then there exists a (different) algorithm whose output is statistically close to that of A while having short 
randomized description. We remark that this reduction requires the knowledge of the marginal distribution 
A{S). 

Lemma 31. Let A be a randomized algorithm taking as an input a dataset of n points from X and outputting 
a value in y. Let Z be a random variable over y. For k > 0 and a dataset S let (3s = min{/3 | D^{A{S)\\Z) < 
k}. There exists an algorithm A! that given S G X", (3, k and any (3' > 0, 

1. the output of A' has randomized description length fc-I-logln(l//3'). 

2. for every S, A{A'{S), A{S)) < I3s + (3'. 

Proof. Let S denote the input dataset. By definition of (3s, {A{S)\\Z) < k. By the properties of 

approximate divergence (e.o. [DRr4] L {A{S)\\Z) < k implies that there exists a random variable Y such 
that A(xl(5'),T) < Ps and D^{Y\\Z) < k. 

For t = 2^1n(l//3') the algorithm A' randomly and independently chooses t samples from Z. Denote 
them by ?/i, 2 / 2 j • ■ •, Vt- For i = 1,2,... ,t, A' outputs pi with probability pi = 

sample otherwise. Note that pi G [0,1] and therefore this is a legal choice of probability. When all samples 
are exhausted the algorithm outputs yi. 

We hrst note that by the definition of this algorithm its output has randomized description length 
logt = fc -I- logln(l//3'). Let T denote the event that at least one of the samples was accepted. Conditioned 
on this event the output of .4'(S') is distributed according to Y. For each i. 


E 

Vi~p(Z) 


b.] 


E 

Vi~p(Z) 


ny = y^] ' 

2>^-V[Z = yi]_ 
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This means that the probability that none of t samples will be accepted is (1 — 2 < /3'. Therefore 

A(^'(S'),'K) < /?' and, consequently, A{A'(S), A{S)) < Ps + P'■ D 

We can now use Lemma to show that if for a certain random choice of a dataset S, the output of A has 
low approximate max-information then there exists an algorithm A! whose output on S has low randomized 
description length and is statistically close to the output distribution of A. 

Theorem 32. Let S be a random dataset in T" and A be an algorithm taking as an input a dataset in 
T" and having a range y. Assume that for some P >0, I^{S] A{S)) = k. For any f3' > 0, there exists an 
algorithm A1 taking as an input a dataset in T" such that 

1. the output of A' has randomized description length fc-|-logln(l//3'); 

A A((5,.4'(S)),(S,.4(5))</3 + /3'. 

Proof. For a dataset S let Ps = min{/3 | D^(.4(5')||.A(S')) < k}. To prove this result it suffices to observe 
that E[/3s:] < P and then apply Lemma [M] with Z — A(5') To show that E[/3s'] < /3 let Os C y denote an 
event such that P[yi(5') e Os] = 2^ • P[y^(5') e Os] + Ps- Let O = ^s)}- Then, 

niS,A{S) €0]= E [F[iS,AiS) e Os]] 

S~p{S) 

= E [2'= • F[A{S) e Os] + Ps] 

= 2^= •P[S' X A{S) e 0]+E[Ps]- 


If E[/3s:] > P then it would hold for some k' > k that P[(5,.4(5) G O] = 2^^' ■ P[5' x .A(S) & O] + P 
contradicting the assumption I^{S;A{S)) = k. We remark that, it is also easy to see that E[/3s] = P- □ 

It is important to note that Theorem is not the converse of Theorem and does not imply equivalence 
between max-information and randomized description length. The primary difference is that Theorem |32| 
defines a new algorithm rather than arguing about the original algorithm. In addition, the new algorithm 
requires samples of A{S), that is, it needs to know the marginal distribution on y. As a more concrete 
example. Theorem does not allow us to obtain a description-length-based equivalent of Theorem for all 
i.i.d. datasets. On the other hand, any algorithm that has bounded max-information for all distributions over 
datasets can be converted to an algorithm with low randomized description length. 

Theorem 33. Let A be an algorithm over A" with range y and let k = Loo(A,n). For any /3 > 0, there 
exists an algorithm A' taking as an input a dataset in A" such that 


1. the output of A' has randomized description length fc-I-logln(l//3); 

2. for every dataset S G A", A(A'(S'), A(5')) < p. 

Proof. Let Sq = {x,x,..., x) be an n-element dataset for an arbitrary a; G A. By Lemma|^ we know that for 
every S G A", L>oo(A(S')|lA(S'o)) < fc. We can now apply Lemma 31 with Z = A(S'o), and P' = P to obtain 
the result. □ 


The conditions of Theorem 33 are satisfied by any e-differentially private algorithm with fc = log e • sn. 
This immediately implies that the output of any £-differentially private algorithm is /^-statistically close to 
the output of an algorithm with randomized description length of loge • sn + logln(l//3). Special cases of 
this property have been derived (using a technique similar to Lemma 31) in the context of proving lower 
bounds for learning algorithms IBNS13I and communication complexity of differentially private protocols 
MMP+in) . 
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B Answering Queries via Description Length Bounds 


In this section, we show a simple method for answering any adaptively chosen sequence of m statistical queries, 
using a number of samples that scales only polylogarithmically in m. This is an exponential improvement 
over what would be possible by naively evaluating the queries exactly on the given samples. Algorithms that 
achieve such dependence were given in [DFH+14] and [BSSUIS] using differentially private algorithms for 
answering queries and the connection between generalization and differential privacy (in the same way as 
we do in Section 4.1). Here we give a simpler algorithm which we analyze using description length bounds. 
The resulting bounds are comparable to those achieved in |DFH+14] using p ure differen tial privacy but are 
somewhat weaker than those achieved using approximate differential privacy |DFH~*~14l IBSSU151 INS 15) . 

The mechanism we give here is based on the Median Mechanism of Roth and Roughgarden [RRlOj . A 
differentially private variant of this mechanism was introduced in |RR10j to show that it was possible to 
answer exponentially many adaptively chosen counting queries (these are queries for an estimate of the 
empirical mean of a function 0 ; A —> [0,1] on the dataset). Here we analyze a noise-free version and establish 
its properties via a simple description length-based argument. We remark that it is possible to analogously 
define and analyze the noise-free version of the Private Multiplicative Weights Mechanism of Hardt and 
Rothblum [?]. This somewhat more involved approach would lead to better (but qualitatively similar) bounds. 

Recall that statistical queries are defined by functions </): A —>■ [0,1], and our goal is to correctly estimate 
their expectation P [</>]. The Median Mechanism takes as input a dataset S and an adaptively chosen sequence 
of such functions (/)i,..., (pm, and outputs a sequence of answers ai,, am- 


Algorithm Median Mechanism 

Input: An upper bound m on the total number of queries, a dataset S and an accuracy parameter t 


1. Let a = |. 

log 

2. Let Consistento = A^ 

3. For a query (j)i do 


(a) Compute = median({f 5 /[^i] : S' G Consistenti_i}). 

(b) Compute at 

(c) If 


pub priv 

af — af 


P"" = SsiU 

< 2a Then: 


i. Output Oi = af“*’. 

ii. Let Consistenti = Consistenti_i. 

(d) Else: 

i. Output Oi = 

ii. Let Consistenti = {5" G Consistenti_i : \ai — < 2a}. 


Figure 5: Noise-free version of the Median Mechanism from |RR10 | 


The guarantee we get for the Median Mechanism is as follows: 

Theorem 34. Let /3, r > 0 and m > B > 0. Let S denote a dataset of size n drawn i.i.d. from a distribution V 
over A. Consider any algorithm that adaptively chooses functions cfi,. ■ ■ (f’m while interacting with the 
Median Mechanism which is given S and values t, f3. For every i G [m], let ai denote the answer of the 
Median Mechanism on function cfi : X ^ [0,1]. Then 

P [3i G [m], la* - P[ct)i\\ > t] < /3 
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whenever 


81 • log A’ • logm • ln(3m/r) 91n(2m//3) 
n>n, = . 2^3 .+ 2^2 • 

Proof. We begin with a simple lemma which informally states that for every distribution V, and for every set 
of m functions (fi,..., (j)m , there exists a small dataset that approximately encodes the answers to each of 
the corresponding statistical queries. 

Lemma 35 f |DR14| Theorem 4.2). For every dataset S over X, any set C = {4>i, ■ ■ ■, </'m} o/m functions 
(j>i ■. X ^ [0,1], and any a G [0,1], there exists a data set S' G df* of size t = such that: 


(pi^C 

Next, we observe that by construction, the Median Mechanism (as presented in Figure]^ always returns 
answers that are close to the empirical means of respective query functions. 

Lemma 36. For every sequence of queries (/)i,..., (pm and dataset S given to the Median Mechanism, we 
have that for every i: 

\ai - EsiPiW < 2 q ;. 

Finally, we give a simple lemma from |R.R10j that shows that the Median Mechanism only returns answers 
computed using the dataset S' in a small number of rounds - for any other round i, the answer returned is 
computed from the set Consistenti_i. 

Lemma 37 f |RR10] . see also Chapter 5.2.1 of |DR14p . For every sequence of queries (pi,... ,<pm and a 
dataset S given to the Median Mechanism: 




< 


log |A'| logm 




1 / ^ 

Proof. We simply note several facts. First, by construction, |Consistento| = ^ Second, by Lemma 

for every i, |Consistenti| > 1 (because for every set of m queries, there is at least one dataset S' of 
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size logm/a^ that is consistent up to error a with S on every query asked, and hence is never removed 
from Consistenti on any round). Finally, by construction, on any round i such that Oi ^ we have 

|Consistenti| < ^ • |Consistenti_i| (because on any such round, the median dataset S' - and hence at least 
half of the datasets in Consistenti_i were inconsistent with the answer given, and hence removed.) The 


lemma follows from the fact that there can therefore be at most log ( \X 


ilog mj oT 


many such rounds. □ 


Our analysis proceeds by viewing the interaction between the data analyst and the Median Mechanism 
(Fig. (I) as a single algorithm A. A takes as input a dataset S and outputs a set of queries and answers 
A{S) = {pi, Oi}™!. We will show that .A’s output has short randomized description length (the data analyst 
is a possibly randomized algorithm and hence A might be randomized). 

Lemma 38. Algorithm A has randomized description length of at most 


b < 


log \X\ ■ logm 


log m + log — 
a 


bits. 


Proof. We observe that for every fixing of the random bits of the data analyst the entire sequence of queries 
asked by the analyst, together with the answers he receives, can be reconstructed from a record of the 
indices i of the queries pi such that Oi ^ together with their answers Oi {i.e. it is sufficient to encode 
M := {(i, Oi) I Oi ^ Once this is established, the lemma follows because by Lemma 37 there are at 

most such queries, and for each one, its index can be encoded with logm bits, and its answer 

with log - bits. 
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To see why this is so, consider the following procedure for reconstructing the sequence (^i, ai,..., (j)m,o,m) 
of queries asked and answers received. For every fixing of the random bits of the data analyst, her queries 
can be expressed as a sequence of functions (/i,..., /„) that take as input the queries previously asked to 
the Median Mechanism, and the answers previously received, and output the next query to be asked. That is, 
we have: 


flO ■— /2(</i’l, Oi) (/)2, /3(</^’l, Ol: </'2, 02) <^3, Ol, ■ • ■ ,Om-l) 


Assume inductively that at stage i, the procedure has successfully reconstructed {(pi, oi,..., 4>i-i,ai-i, (pi), 
and the set Consistenti_i (This is trivially satisfied at stage i = 1). For the inductive case, we need to recover 
ai, pi+i, and Consistent^. There are two cases we must consider at stage i. In the first case, i is such that 
Qi 7 ^ But in this case, {i,ai) G M by definition, and so we have recovered a^, and we can compute 

pi+i = fi+i{pi, ai,... ,pi,ai), and can compute Consistenti = {S' G Consistenti_i : \ai — £s' [(pi]\ ^ 2a}. In 
the other case, Ui = a?”*’. But in this case, by definition of we can compute ai = median({£ 5 /[c/ii] : 
S' G Consistenti_i}), pi+i = fi+i{<pi,ai,... ,pi,ai), and Conisistenti = Consistenti_i. This completes the 
argument - by induction, M is enough to reconstruct the entire query/answer sequence. □ 

Finally, we can complete the proof. By Hoeffding’s concentration inequality and the union bound we know 
that for any every sequence of queries pi,..., p^n and a dataset S of size n drawn from the distribution 'P”: 


P [3t, > ck] < 2m • exp (—2na^ 


Applying Theorem 29 to the set R{pi,ai,..., pm, Om) = [S \ 3i, — V[pi] | > a} we obtain that for 

the queries pi,..., pm generated on the dataset S and corresponding answers of the Median Mechanism 
ai,..., am we have 


P[3i, \£s[P(]-V[P,]\>a] 


< 2 ^ • 2 m • exp (— 2 na^) . 

< . 2 m • exp {-2na^) . 


Solving, we have that whenever: 


^ log|A| • logm , ^ ln(2m//3) 

" -- —< - 


we have: P [3i, \£s[pi] — > o;] < /3. Combining this with Lemma 36 we have: 

P[3iG[m], \ai - VlPiW > 3a] < f3. 


Plugging in T = 3a gives the theorem. 


□ 
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